Take-home_Ex03

Author

LIANG YAO

Published

June 17, 2023

Modified

June 10, 2023

1. Task and Questions:

Objectives:

FishEye International, a non-profit focused on countering illegal, unreported, and unregulated (IUU) fishing, has been given access to an international finance corporation’s database on fishing related companies. In the past, FishEye has determined that companies with anomalous structures are far more likely to be involved in IUU (or other “fishy” business). FishEye has transformed the database into a knowledge graph. It includes information about companies, owners, workers, and financial status. FishEye is aiming to use this graph to identify anomalies that could indicate a company is involved in IUU.

FishEye analysts have attempted to use traditional node-link visualizations and standard graph analyses, but these were found to be ineffective because the scale and detail in the data can obscure a business’s true structure. Can you help FishEye develop a new visual analytics approach to better understand fishing business anomalies?

Questions:

Use visual analytics to understand patterns of groups in the knowledge graph and highlight anomalous groups.

  1. Use visual analytics to identify anomalies in the business groups present in the knowledge graph. Limit your response to 400 words and 5 images.

  2. Develop a visual analytics process to find similar businesses and group them. This analysis should focus on a business’s most important features and present those features clearly to the user. Limit your response to 400 words and 5 images.

  3. Measure similarity of businesses that you group in the previous question. Express confidence in your groupings visually. Limit your response to 400 words and 4 images.

  4. Based on your visualizations, provide evidence for or against the case that anomalous companies are involved in illegal fishing. Which business groups should FishEye investigate further? Limit your response to 600 words and 6 images.

2. Load packages and data:

Show the code
pacman::p_load(igraph, ggraph, visNetwork, tidyverse, graphlayouts, jsonlite, heatmaply, stringr, tidytext)
Show the code
mc3 <- jsonlite::fromJSON("data/MC3.json")

2.1 Find the nodes and edges:

Show the code
#view(mc2[["nodes"]])
mc3_nodes <- as_tibble(mc3$nodes) %>%
  mutate(country=as.character(country),
         id=as.character(id),
         product_services=as.character(product_services),
         revenue_omu = as.numeric(as.character(revenue_omu)),
         type=as.character(type)) %>%
  select(id,country, type, revenue_omu, product_services) 
#  group_by(id,country, type, product_services) %>%
#  summarise(count=n(),revenue=sum(revenue_omu))
Show the code
#view(mc2[["links"]])
mc3_edges <- as_tibble(mc3$links) %>%
  distinct() %>%
  mutate(source=as.character(source),
         target=as.character(target),
         type=as.character(type)) %>%
  group_by(source, target, type) %>%
  summarise(weight=n()) %>%
  filter(source!=target) %>%
  ungroup()
Show the code
glimpse(mc3_edges)
Rows: 24,036
Columns: 4
$ source <chr> "1 AS Marine sanctuary", "1 AS Marine sanctuary", "1 Ltd. Liabi…
$ target <chr> "Christina Taylor", "Debbie Sanders", "Angela Smith", "Catherin…
$ type   <chr> "Company Contacts", "Beneficial Owner", "Beneficial Owner", "Co…
$ weight <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

3. Initial Data Exploring:

3.1 Exploring the edge data frame.

Show the code
ggplot(data = mc3_edges, 
       aes(x=type)) +
  geom_bar() +
  xlab("Type")+
  ylab("Count")

Show the code
DT::datatable(mc3_edges)

3.2 Exploring the nodes data frame.

Show the code
ggplot(data = mc3_nodes, 
       aes(x=type)) +
  geom_bar() +
  xlab("Type")+
  ylab("Count")

Show the code
DT::datatable(mc3_nodes)
Note:

.

3.3 Text sensing and nodes categorization:

3.3.1 Extract text and preprocessing.

  • Extract text
Show the code
mc3_nodes %>%
  select(product_services) %>%
  group_by(product_services) %>%
  summarise(count = n())
# A tibble: 3,244 × 2
   product_services                                                        count
   <chr>                                                                   <int>
 1 (Italian) peeled tomatoes, legumes, vegetables, fruits and canned mush…     1
 2 100 percent Spanish olives; peppers, green, black, and manzanilla stuf…     1
 3 2 or 3-piece containers, twist off caps, easy opening and traditional …     1
 4 8 Cement Mixer Units, Ocean Freight, Air Freight, Project Logistics, C…     1
 5 A chemical science firm with a focus on the development of high purity…     1
 6 A complete range of fully-vertical, Schiffli embroidery manufacturing …     1
 7 A complete range of transportation and logistics services                   1
 8 A customs broker and freight forwarder                                      1
 9 A distributor, importer and exporter of food products to the food reta…     1
10 A freight broker                                                            1
# ℹ 3,234 more rows
  • Word tokenization with punctuation and no lowercasing
Show the code
tidy_nodes <- mc3_nodes %>%
  unnest_tokens(word, product_services, to_lower = TRUE, strip_punct = TRUE)
  • Removing stop words
Show the code
tidy_stopwords <- tidy_nodes %>%
  anti_join(stop_words) #%>%
#  anti_join(by = c("character","0","unknown"))
Note:

Here need to also remove “character”,“0”,“unknown” as stop words.

3.3.2 Find nodes’ product services mainly focus on “fish” or “transportation”:

  • nodes’ product services mainly focus on “fish”
Show the code
mc3_nodes %>%
  mutate(n_fish = str_count(product_services), "fish")
# A tibble: 27,622 × 7
   id                 country type  revenue_omu product_services n_fish `"fish"`
   <chr>              <chr>   <chr>       <dbl> <chr>             <int> <chr>   
 1 Jones LLC          ZH      Comp…  310612303. Automobiles          11 fish    
 2 Coleman, Hall and… ZH      Comp…  162734684. Passenger cars,…     39 fish    
 3 Aqua Advancements… Oceanus Comp…  115004667. Holding firm wh…    248 fish    
 4 Makumba Ltd. Liab… Utopor… Comp…   90986413. Car service, ca…    428 fish    
 5 Taylor, Taylor an… ZH      Comp…   81466667. Fully electric …     72 fish    
 6 Harmon, Edwards a… ZH      Comp…   75070435. Discount superm…     59 fish    
 7 Punjab s Marine c… Riodel… Comp…   72167572. Beef, pork, chi…    652 fish    
 8 Assam   Limited L… Utopor… Comp…   72162317. Power and Gas s…   1737 fish    
 9 Ianira Starfish S… Rio Is… Comp…   68832979. Light commercia…     94 fish    
10 Moran, Lewis and … ZH      Comp…   65592906. Automobiles, tr…     88 fish    
# ℹ 27,612 more rows
  • nodes’ product services mainly focus on “transportation”, “warehouse”, “trad”,“commercial”:
Show the code
mc3_nodes %>%
  mutate(n_logistic = str_count(product_services), "transportation")
# A tibble: 27,622 × 7
   id                      country type  revenue_omu product_services n_logistic
   <chr>                   <chr>   <chr>       <dbl> <chr>                 <int>
 1 Jones LLC               ZH      Comp…  310612303. Automobiles              11
 2 Coleman, Hall and Lopez ZH      Comp…  162734684. Passenger cars,…         39
 3 Aqua Advancements Sash… Oceanus Comp…  115004667. Holding firm wh…        248
 4 Makumba Ltd. Liability… Utopor… Comp…   90986413. Car service, ca…        428
 5 Taylor, Taylor and Far… ZH      Comp…   81466667. Fully electric …         72
 6 Harmon, Edwards and Ba… ZH      Comp…   75070435. Discount superm…         59
 7 Punjab s Marine conser… Riodel… Comp…   72167572. Beef, pork, chi…        652
 8 Assam   Limited Liabil… Utopor… Comp…   72162317. Power and Gas s…       1737
 9 Ianira Starfish Sagl I… Rio Is… Comp…   68832979. Light commercia…         94
10 Moran, Lewis and Jimen… ZH      Comp…   65592906. Automobiles, tr…         88
# ℹ 27,612 more rows
# ℹ 1 more variable: `"transportation"` <chr>
  • check nodes’ category by their product services

4. Pattern Analysis & Visualization

4.1 Visualizing temporal patterns for individual entities by heatmap

4.1.1 Transforming the data frame into a matrix

Find edges filtering by those majority hscodes.

4.1.2 Building heatmap